We have data about users who hit our site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site).
The dataset (conversion_data.csv) has the following columns:
Let's first identify the basic structures of the dataset. This includes column names, what they stand for, their data types, the upper bound and lower bound of the values and the scale of the dataset.
with open('conversion_data.csv') as f:
for num, line in enumerate(f):
if num > 5:
break
print (num, line)
import pandas as pd
df = pd.read_csv('conversion_data.csv')
df.describe(include='all')
df.head()
How many users are below 20 years old?
df[df.age<20]
How many users are above 70 years old?
df[df.age>70]
It seems a little off to have users with age 123 and 111. When we are in doubt, it's usually safe to remove the "strange" records. In reality, we should better talk to the software engineer who implemented the code to see if these records may be caused by some bugs in the data collector. If they are indeed caused by software bugs, the same bugs might affect additional data significantly.
df = df[df.age<80]
df.head()
Next, we will get ourselves familiar with the dataset. This step is called exploratory data analysis, which typically happens before the model building.
In the Data Science and Big Data Analytics course I help built while at EMC Education Services, we talk a lot about how a data science project is different from a software engineering project and how it should follow a Data Analytics Lifecycle:

Exploratory data analysis happens during the second step, data preparation. You can learn more about the Data Analytics Lifecycle from this Youtube video, which was taken from this Video-ILT course we built.
Let's start by taking a look at the age distribution of our users.
from bokeh.charts import Histogram, output_notebook, show
from bokeh.resources import Resources
resource = Resources(mode='inline')
output_notebook(resources=resource)
p = Histogram(df['age'], bins=30, title="Age Distribution (30 bins)")
show(p)
You can see that the majority of our users is young, centering around 30 and between 24 and 36.
Next, we will check out the conversion rate of age and see if age is a determing factor. As you can see from the graph below, convertion rate goes down as age increases, except the users around 60 years old.
As a next step action item, we should talk to product and marketing and get more data to find out: What happend to our older users? What did we do wrong? How did a 60-years-old become an exception? What can we learn from that?
grouped = df.loc[:, ['age', 'converted']].groupby('age')
data_age = grouped.mean()
data_age["Age"] = data_age.index
# data_age
from bokeh.charts import Line, output_notebook, show
output_notebook(resources=resource)
p = Line(data_age, x='Age', y='converted', color="blue", title="Conversion Rate by Age",
plot_width=900, plot_height=400)
show(p)
Next, let's look at the convertion rate by country. Germany has a much higher conversion rate than other countries and China has a much much lower conversion rate. As a next step, we should find out why we did so poorly with our Chinese users. Is it because the site is not localized well, or perhaps our payment system is not functioning well in China? We should also look at how we did so well with Germany and if we can apply what we learned to other countries.
from bokeh.charts import Bar, output_notebook, show
output_notebook(resources=resource)
p = Bar(df, 'country', values='converted', agg='mean', title="Conversion Rate by Country")
show(p)
Combining the conversion rate graph with the converted count reveals another interesting fact. The site is working very well for Germany in terms of conversion. But the converted count shows that there are way less Germans coming to the site compared to UK, despite a larger population. Therefore, marketing should get more Germans. There's some big opportunity there.
import numpy as np
grouped = df.loc[:, ['country', 'converted']].groupby('country')
data_country = grouped.sum()
data_country
We can plot the counversion rate by country using matplotlib in XKCD style.
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
ind = np.arange(len(df.country.unique()))
width = 0.5
plt.xkcd()
# fig = plt.figure()
fig, ax = plt.subplots()
ax.bar(ind, data_country.converted, width, color="black")
ax.set_title("Conversion Rate by Country")
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(data_country.index)
As you can see, source is not an impacting factor of conversion rate.
output_notebook(resources=resource)
p = Bar(df, 'source', values='converted', agg='mean', color="wheat", title="Conversion Rate by Source")
show(p)
The following plot shows old users have a much higher conversion rate. Thus we should focus on marketing toward existing users.
output_notebook(resources=resource)
p = Bar(df, 'new_user', values='converted', agg='mean', color="green", title="Conversion Rate by New User")
show(p)
Next, we looks at the relation of total_page_visited and conversion rate. The plot shows at around 8 pages or so, conversion rate increases significantly as the total_page_visited increases. As users browse 18 pages or more, we are pretty certain they will buy from us.
grouped = df.loc[:, ['total_pages_visited', 'converted']].groupby('total_pages_visited')
#grouped.groups
#grouped.sum()
data_pages = grouped.aggregate(np.mean)
data_pages
output_notebook(resources=resource)
p = Line(data_pages, title="Conversion Rate vs Total Pages Visited", legend="top_left", ylabel="Conversion Rate")
show(p)
Now we have a good understanding of the dataset. We can apply a model to the dataset to predict conversion. I'm going to choose Random Forests because: it usually requires very little time to optimize it (its default params are often close to the best ones) and it is strong with outliers, irrelevant variables, continuous and discrete variables.
I will use the random forest to predict conversion, and plot out-of-bag errors in relations to the number of estimators of random forests. I will also use variable importance to get insights about how random forests got information from the variables.
country_encoded, country_index = pd.factorize(df['country'])
df['country_encoded'] = country_encoded
source_encoded, source_index = pd.factorize(df['source'])
df['source_encoded'] = source_encoded
df.head()
from sklearn.cross_validation import train_test_split
# split 80/20 train-test
x_train, x_test, y_train, y_test = train_test_split(df.loc[:, ['country_encoded', 'age', 'new_user', 'source_encoded', 'total_pages_visited']],
df.converted,
test_size=0.2,
random_state=1)
x_train.columns
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, oob_score=True)
clf.fit(x_train, y_train)
clf.oob_score_
clf.n_features_
importance = pd.DataFrame({"feature": pd.Categorical(x_train.columns), "importance": clf.feature_importances_})
output_notebook(resources=resource)
p = Bar(importance, label="feature", values="importance", color="orange", title="Feature importance")
show(p)
Total pages visited is the most important one by far. Unfortunately, it is probably the least “actionable”. People visit many pages cause they already want to buy. Also, in order to buy you have to click on multiple pages.
preds = clf.predict(x_test)
pd.crosstab(y_test, preds, rownames=['actual'], colnames=['preds'])
We will use various metrics to measure the performance of our Random Forests classifier. These metrics include but not limited to: accuracy, confusion matrix, AUC, ROC Curve, FPR and TPR.
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
print ("Accuracy:", accuracy_score(y_test, preds) )
print ("Confusion Matrix:\n", confusion_matrix(y_test, preds) )
fpr, tpr, thresholds = roc_curve(y_test, preds)
fpr
tpr
thresholds
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Random Forests')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
auc(fpr, tpr)
from bokeh.plotting import figure, show, output_notebook
output_notebook(resources=resource)
p = figure(title="Receiver Operating Characteristic",
y_range=(0.0, 1.05))
p.line(fpr, tpr, legend="Random Forests")
show(p)
%matplotlib inline
import matplotlib.pyplot as plt
from collections import OrderedDict
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.cross_validation import train_test_split
plt.figure(1, figsize=(12, 8))
RANDOM_STATE = 123
NTREES = 100
# split 80/20 train-test
x_train, x_test, y_train, y_test = train_test_split(df.loc[:, ['country_encoded',
'age',
'new_user',
'source_encoded',
'total_pages_visited']],
df.converted,
test_size=0.2,
random_state=RANDOM_STATE)
ensemble_clfs = [
("RandomForestClassifier, max_features='sqrt'",
RandomForestClassifier(warm_start=True, n_estimators=NTREES, oob_score=True,
max_features="sqrt",
random_state=RANDOM_STATE))
]
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 200
for label, clf in ensemble_clfs:
for i in range(min_estimators, max_estimators + 1):
clf.set_params(n_estimators=i)
clf.fit(x_train, y_train)
# Record the OOB error for each `n_estimators=i` setting.
oob_error = 1 - clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
# plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
from collections import OrderedDict
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.cross_validation import train_test_split
plt.figure(1, figsize=(12, 8))
RANDOM_STATE = 123
NTREES = 100
# split 80/20 train-test
x_train, x_test, y_train, y_test = train_test_split(df.loc[:, ['country_encoded',
'age',
'new_user',
'source_encoded',
'total_pages_visited']],
df.converted,
test_size=0.2,
random_state=RANDOM_STATE)
ensemble_clfs = [
("RandomForestClassifier, max_features='sqrt'",
RandomForestClassifier(warm_start=True, n_estimators=NTREES, oob_score=True,
max_features="sqrt",
random_state=RANDOM_STATE)),
("RandomForestClassifier, max_features='log2'",
RandomForestClassifier(warm_start=True, n_estimators=NTREES, max_features='log2',
oob_score=True,
random_state=RANDOM_STATE)),
("RandomForestClassifier, max_features=None",
RandomForestClassifier(warm_start=True, n_estimators=NTREES, max_features=None,
oob_score=True,
random_state=RANDOM_STATE))
]
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 200
for label, clf in ensemble_clfs:
for i in range(min_estimators, max_estimators + 1):
clf.set_params(n_estimators=i)
clf.fit(x_train, y_train)
# Record the OOB error for each `n_estimators=i` setting.
oob_error = 1 - clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
# plt.show()
As mentioned before, total_page_visited is the most important variable but probably the least “actionable”. Thus we will remove this variable and rebuild the random forests.
from sklearn.cross_validation import train_test_split
# split 80/20 train-test
x_train, x_test, y_train, y_test = train_test_split(df.loc[:, ['country_encoded', 'age', 'new_user', 'source_encoded']],
df.converted,
test_size=0.2,
random_state=1)
x_train.columns
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
importance = pd.DataFrame({"feature": pd.Categorical(x_train.columns), "importance": clf.feature_importances_})
from bokeh.charts import Bar, output_notebook, show
output_notebook(resources=resource)
p = Bar(importance, label="feature", values="importance", color="gray", title="Feature importance")
show(p)
preds = clf.predict(x_test)
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
print ("Accuracy:", accuracy_score(y_test, preds) )
print ("Confusion Matrix:\n", confusion_matrix(y_test, preds) )
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, x_test, y_test)
scores.mean()
from sklearn.cross_validation import train_test_split
# split 80/20 train-test
x_train, x_test, y_train, y_test = train_test_split(df.loc[:, ['country_encoded', 'age', 'new_user', 'source_encoded']],
df.converted,
test_size=0.2,
random_state=1)
x_train.columns
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=10, min_samples_split=10, max_leaf_nodes=10)
clf = clf.fit(x_train, y_train)
importance = pd.DataFrame({"feature": pd.Categorical(x_train.columns), "importance": clf.feature_importances_})
output_notebook(resources=resource)
p = Bar(importance, label="feature", values="importance", color="orange", title="Feature importance of Decision Trees")
show(p)
np.unique(y_train.values)
from sklearn.externals.six import StringIO
from IPython.display import Image
import numpy as np
import pydot_ng as pydot
classnames = np.unique(y_train.values)
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
feature_names=x_train.columns,
class_names=["ExistingUser", "NewUser"],
filled=True, rounded=True,
special_characters=False)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Similar to the random forests, decision trees classifies everything as ExistingUser. After all, in our dataset, we started from a 97% accuracy (that’s the case if we classified everything as “not converted”). To address the issue, we should introduce additional features to enrich the dataset and analysis and try out more classifiers.